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Assessing  Semantic  Knowledge  Using 
Computer-Based  and  Paper-Based  Media 


Pat-Anthony  Federico 


Navy  Personnel  Research  and  Development  Center 


Abstract  —  Using  a  within-subjects  design,  75  naval  pilots  and  flight  officers 
were  administered  computer-based  and  paper-based  tests  to  assess  semantic 
knowledge  in  order  to  determine  the  relative  reliabilities  and  validities  of  these 
two  measurement  modes.  Estimates  of  internal  consistencies,  equivalences,  and 
discriminative  validities  were  computed  for  multiple  performance  measures.  It 
was  revealed  that  the  relative  reliabilities  derived  for  these  two  assessment 
schemes  using  multivariate  measurement  criteria  were  not  significantly  different, 
and  the  discriminant  validity  of  computer-based  measures  was  superior  to  paper- 
based  measures. 


The  consequences  of  computer-based  assessment  on  examinees’  performance  are 
not  obvious.  The  investigations  that  have  been  conducted  on  this  topic  have  pro¬ 
duced  mixed  results.  Some  studies  (D.  F.  Johnson  &  Mihal,  1973;  Serwer  & 
Stolurow,  1970)  demonstrated  that  test-takers  do  better  on  verbal  items  given  by 
computer  than  on  paper-based  items;  however,  just  the  opposite  was  found  by  other 
studies  (D.  F.  Johnson  &  Mihal,  1973;  Wildgrube,  1982).  One  investigation 
fSachar  &  Fletcher,  1978)  yielded  no  significant  differences  resulting  from  com¬ 
puter-based  and  paper-based  modes  of  administration  on  verbal  items.  Two  studies 
(Einglish,  Reckase,  &  Patience,  1977;  Hoffman  &  Lundberg,  1976)  demonstrated 
that  these  two  testing  modes  did  not  affect  jjerformance  on  memory-retrieval  items. 
Sometimes  (D.  F.  Johnson  &  Mihal,  1973)  test-takers  do  better  on  quantitative  tests 
when  computer  given,  sometimes  (Lee,  Moreno,  &  Sympson,  1984)  they  do  worse, 
and  other  times  (Wildgrube,  1982)  it  may  make  no  difference.  Other  studies  have 
supported  the  equivalence  of  computer-based  and  paper-based  administration 
(Klwood  &  Griffin,  1972;  Hedl,  O’Neil,  &  Hansen,  1973;  Kantor,  1988;  Lukin, 
Dowd,  Plake,  &  Kraft,  1985).  Some  re.searchers  (Evan  &  Miller,  1969;  Koson, 
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Kitchen,  Kochen,  &  Stodolosky,  1970;  Lucas,  Mullin,  Luna,  &  Mclnroy,  1977; 
Lukin  et  al.,  1985;  Skinner  &  Allen,  1983)  have  reported  comparable  or  superior 
psychometric  properties  of  computer-based  assessment  relative  to  paper-based 
assessment  in  clinical  settings. 

Investigations  of  computer-based  presentation  of  personality  items  have  yielded 
reliability  and  validity  indices  comparable  to  typical  paper-based  presentation 
(Katz  &  balby,  1981;  Lushene,  O’Neil,  &  Dunn,  1974).  No  significant  differences 
were  found  in  the  scores  of  measures  of  anxiety,  depression,  and  psychological 
reactance  due  to  computer-based  and  paper-based  administration  (Lukin  et  al., 
1985).  Studies  of  cognitive  tests  have  provided  inconsistent  findings,  with  some 
(Hitti,  Riffer,  &  Stuckles,  1971;  Rock  &  Nolen,  1982)  demonstrating  that  the  com¬ 
puterized  version  is  a  viable  alterative  to  the  paper-based  version.  Other  research 
(Hansen  &  O’Neil,  1970;  Hedl  et  al.,  1973;  D.  F.  Johnson  &  White,  1980;  J.  H. 
Johnson  &  K.  N.  Johnson.  1981),  though,  indicated  that  interacting  with  a  comput¬ 
er-based  system  to  take  an  intelligence  test  could  elicit  a  considerable  amount  of 
anxiety  which  could  affect  performance. 

Regarding  computerized  adaptive  testing  (CAT),  some  empirical  comparisons 
(McBride,  1980;  Symp.son,  Weiss,  &  Ree,  1982)  yielded  essentially  no  change  in 
validity  due  to  mode  of  administration.  However,  test-item  difficulty  may  not  be 
indifferent  to  manner  of  presentation  for  CAT  (Green.  Bock,  Humphreys,  Linn,  & 
Recka.se.  1984).  When  going  from  paper-based  to  computer-based  administration, 
this  mode  effect  is  thought  to  have  three  aspects:  (a)  an  overall  mean  shift  where  all 
items  may  be  easier  or  harder,  (b)  an  item-mode  interaction  where  a  few  items  may 
be  altered  and  others  not,  and  (c)  a  change  in  the  nature  of  the  task  itself  caused  by 
computer  administration.  A  computer  simulation  study  (Divgi,  1988)  demonstrated 
that  a  CAT  version  of  the  Armed  Services  Vocational  Aptitude  Battery  had  a  higher 
reliability  than  a  paper-ba.sed  version  for  these  subtests:  (a)  general  science,  (b) 
arithmetic  reasoning,  (c)  word  knowledge,  (d)  paragraph  comprehension,  and  (e) 
mathematics  knowledge.  The  inconsistent  results  of  mode,  manner,  or  medium  of 
testing  may  be  due  to  differences  in  methodology,  test  content,  population  te.sted. 
or  the  design  of  the  study  (Lee  et  al.,  1984). 

With  computer  co.sts  coming  down  and  people’s  knowledge  of  the.se  systems 
going  up.  it  becomes  more  likely  economically  and  technologically  that  many  ben¬ 
efits  can  be  gained  from  their  use.  A  direct  advantage  of  computer-based  testing  is 
that  individuals  can  respond  to  items  at  their  own  pace,  thus  producing  ideal 
power  tests.  Some  indirect  advantages  of  computer-based  assessment  are 
increased  test  .security,  less  ambiguity  about  students’  responses,  minimal  or  no 
paperwork,  immediate  scoring,  and  automatic  records  keeping  for  item  analysis 
(Green.  1983a.  1983b).  Some  of  the  strongest  support  for  computer-ba.sed  assess¬ 
ment  is  based  upon  the  awareness  of  faster  and  more  economical  measurement 
(LI wood  &  Griffin.  1972;  D.  F.  Johnson  &  White,  1980;  Space,  1981).  Cory 
(1977)  reported  some  advantages  of  computerized  over  paper-ba.sed  testing  for 
predicting  job  pcrfonnance. 

Ward  (1984)  stated  that  computers  can  be  employed  to  augment  what  is  possible 
with  paper-based  nveasurement  (e.g.  to  obtain  more  prcci.se  information  regarding  a 
student  than  is  likely  with  more  customary  measurement  methods)  and  to  assess 
additional  aspects  of  pcrfonnance.  He  enumerated  and  di.scussed  potential  benefits 
that  may  be  derived  from  employing  computer-based  systems  to  admini.ster  tradi¬ 
tional  tests.  Some  of  these  arc  as  follows:  (a)  individualized  assessment,  (b) 
increased  llcxibility  and  efficiency  for  managing  test  information,  (c)  enhanced 
economic  value  and  manipulation  of  measurement  databases,  and  (d)  improved 
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diagnostic  testing.  Millman  (1984)  agreed  with  Ward  about  computer-based  mea¬ 
surement  encouraging  individualized  assessment  and  designing  software  within  the 
context  of  cognitive  science.  Also,  limiting  computer-based  assessment  is  not  so 
much  hardware  inadequacy,  but  incomplete  comprehension  of  the  processes  intrin¬ 
sic  to  testing  (Federico,  1980). 

Simplistic  conceptual  or  associative  knowledge  can  be  represented  as  semantic 
networks  (Barr  &  Feigenbaum,  1981).  These  symbolic  schemes  usually  consist  of 
nodes  (e.g.,  circles  or  boxes)  and  links  (e.g.,  arcs  or  arrows)  connecting  the  nodes. 
Typically,  nodes  represent  objects,  concepts,  or  situations  in  some  knowledge 
domain,  and  links  represent  the  relations,  associations,  or  dependencies  between 
them.  Semantic  networks  have  been  used  as  cognitive  models  of  human  memor\' 
and  repre.sentational  schemes  for  artificial  intelligence  systems.  These  symbolic 
networks  are  es.sentially  universal  or  generic  in  nature.  Being  applicable  or  suitable 
to  an  almost  infinite  number  of  knowledge  domains  or  subject  matters,  it  makes 
sense  in  terms  of  minimizing  effort  and  cost  to  develop  computer-based  testing  sys¬ 
tems  that  incorporate  semantic  networks.  However,  an  important  question  remains 
to  be  answered:  How  effective  are  these  systems  when  compared  to  more  custom¬ 
ary  measurement  methods?  Differences  between  computer-based  assessment 
employing  semantic  networks  and  paper-based  traditional  testing  techniques  may 
or  may  not  impact  upon  the  reliability  and  validity  of  measurement.  The  primary 
purpose  of  this  reported  research  was  to  shed  some  light  on  this  salient  issue  by 
evaluating  empirically  the  relative  reliability  and  validity  of  a  computer-based  and 
a  paper-based  procedure  for  assessing  .semantic  knowledge. 


METHOD 


Subjects 

The  subjects  were  75  male  F-14  pilots,  radar  intercept  officers  (RIOs),  and  stu¬ 
dents.  as  well  as  F-2C  pilots  and  naval  Bight  officers  (NFOs)  from  training  and 
operational  .squadrons  at  Naval  Air  Station  (NAS)  Miramar.  All  had  volunteered  to 
participate  in  this  research. 

Subject  Matter 

.\  database  was  developed  that  consisted  of  five  categories  of  facts  about  front-line 
Soviet  platforms;  (a)  weapons  systems,  (b)  radar  and  electronic  countermeasure 
(FCM)  systems,  (c)  surface  and  subsurlace  platforms,  (d)  airborne  platforms,  and 
(e)  counterjamming  procedures.  It  was  used  to  train  and  test  the  subjects  concern¬ 
ing  important  threat  parameters  as.sociated  with  Russian  platforms  (e.g.,  aircraft 
range  and  speed;  payload  of  antiship  missiles;  typical  launch  altitude:  missile 
range,  flight  profile,  velocity,  and  warheads;  other  weapon,  radar,  ECM/ECCM 
I  electronic  countcr-countenneasure]  systems;  surveillance  capabilities). 

The  database  was  structured  as  a  semantic  network  (Barr  &  Feigenbaum.  1981; 
Johnson-Laird.  1983)  in  order  to  represent  the  associative  knowledge  inherent  to  it 
tor  computer  systems.  That  is.  objects  and  their  corresponding  properties, 
attributes,  or  characteristics  were  repre.sented  as  node-link  structures.  The  links 
between  nodes  represent  the  as.sociations  or  relationships  among  objects  or  among 
objects  and  their  attributes.  For  example,  the  object  “aircraft  type"  and  the  attribute 
“FX’M  suite"  can  he  linked  so  that  the  system  can  repre.sent  a  particular  atreraft 
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type  that  has  a  certain  ECM  suite.  By  defining  initially  all  objects  and  attributes  in 
the  database,  a  hierarchy  or  tree  structure  can  be  specified  for  all  objects,  attributes, 
and  their  relationships.  Once  a  database  was  structured  as  a  semantic  network,  it 
became  possible  for  independent  software  modules  to  interact  with,  operate  upon, 
or  manipulate  the  database.  For  example,  interpretative  programs  could  ask  ques¬ 
tions  about  the  database,  since  its  intrinsic  structure  was  represented.  This  latter 
capability  was  capitalized  upon  in  this  research. 

Computer-Based  Assessment 

A  computer-based  game  or  test,  FlashCards  (Liggett  &  Federico,  1986),  was  adopt¬ 
ed  and  adapted  to  quiz  students  and  instructors  as  well  as  crew  members  of  other 
operational  squadrons  about  the  threat-parameter  database.  This  computer-based 
quiz  is  totally  autonomous  or  independent  of  the  database  and  will  run  on  any 
database  structured  as  a  semantic  network.  It  randomly  .selects  objects  or  chooses 
characteristics  from  the  database,  and  generates  questions  about  threat  platforms  or 
their  salient  attributes.  Unlike  some  computer-based  tests,  alterative  forms  did  not 
have  to  be  specifically  or  previously  programmed  as  such. 

FlashCards  is  analogous  to  using  real  flash  cards.  That  is,  a  question  is  presented 
to  individual  students  who  are  expected  to  answer  it.  Questions  can  have  multiple 
answers  as  with  “What  Soviet  bombers  carry  the  XYZ-123  missile?”  After  individ¬ 
ual  students  are  presented  with  the  question,  they  are  allowed  as  many  tries  as  they 
would  like  to  type  in  the  answer.  If  the  students  cannot  answer  the  question,  they 
can  continue  with  the  quiz.  At  this  point,  they  are  provided  feedback  in  terms  of 
the  correct  answer  or  answers.  At  any  point  in  the  answering  process,  they  can  con¬ 
tinue  to  the  next  question.  For  each  answer,  the  students  must  key  in  a  response 
which  reflects  their  degree  of  confidence  in  their  answer.  Also,  for  each  answer  the 
student’s  response  latency  is  recorded  and  displayed. 

FlashCards  quizzed  the  students  on  all  top-level,  or  general,  categories  of  the 
semantic  network  that  it  was  using  as  the  databa.se.  The  .score  for  each  question 
was  computed  as  the  number  of  correct  answers  entered  divided  by  the  total  num¬ 
ber  of  answers  entered.  For  the  purpo.ses  of  this  research,  a  FlashCards  test  consist¬ 
ed  of  2.'S  completion  or  fill-in-the-blank  domain-referenced  items  or  questions. 
1  hese  were  considered  as  two  groups  of  12  odd  and  even  items  each  (dropping  the 
last  question)  for  computing  split-half  reliability  estimates.  The  average  score  for 
(xld  (even)  items  was  calculated  as  the  total  score  of  odd  (even)  items  divided  by 
the  number  of  odd  (even)  questions  attempted.  The  total  computer-based  lest  score 
was  calculated  as  the  average  of  the  odd  and  even  halves. 

Paper-Based  Assessment 

1  wo  alterative  forms  of  a  paper-based  test  were  designed  and  developed  to  assess 
knowledge  of  the  same  threat-parameter  database  mentioned  above,  and  to  mimic 
as  much  as  possible  the  format  used  by  Fla.shCards.  Both  of  these  consisted  of  2.^ 
completion  or  fill-in-tlie-blank  domain-referenced  items  or  questions.  As  with  the 
computer-based  test,  more  than  one  answer  may  be  required  per  item  or  que.stion 
Beneath  each  question  was  a  confidence  .scale  that  resembled  the  one  used  in 
FlashCards  where  the  test-takers  were  required  to  indicate  the  level  of  confidence 
in  their  respon.se(s).  Scoring  items  for  this  paper-ba.sed  test  was  similar  to  .scoring 
the  computer-based  test:  For  each  question,  the  number  of  correct  answers  given 
was  divided  by  the  total  number  of  answers  completed  for  that  question.  Al.so. 
.scoring  odd  (even)  halves  of  the  test  for  computing  internal  consistency  was  simi- 
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lar  to  that  for  FlashCards.  The  score  for  the  total  paper-based  test  was  calculated 
like  the  total  score  for  the  computer-based  test. 

Procedure 

Subjects  acquired  threat-parameter  knowledge  using  dual  media:  (a)  a  traditional 
text  organized  according  to  the  database’s  major  topics  and  (b)  a  computer-based 
system  consisting  of  the  quizzes  FlashCards  and  Jeopardy.  Mode  of  assessment, 
computer-based  or  paper-based,  was  manipulated  as  a  within-subjects  variable 
(Kirk,  1968).  Subjects  were  administered  the  computer-based  and  paper-based  tests 
in  counterbalanced  order.  The  two  forms  of  the  paper-based  tests  were  alternated  in 
their  administration  to  subjects.  After  subjects  received  either  the  computer-based 
or  paper-based  test,  they  were  immediately  administered  the  other.  It  was  assumed 
that  a  subject’s  state  of  threat-parameter  knowledge  was  the  same  during  the 
administration  of  both  tests.  Subjects  took  approximately  10-15  min  to  complete 
the  paper-based  test,  and  20-25  min  to  complete  the  computer-based  test.  The 
longer  time  to  complete  the  latter  test  was  largely  attributed  to  lack  of  typing  or 
keyboard  proficiency  on  the  part  of  some  of  the  subjects.  The  manner  in  which  the 
subject  matter  was  presented  during  assessment  within  the  c(-mputer-based  and 
paper-based  media  was  essentially  the  same,  due  to  similar  symbol  systems  and 
presentation  formats  being  employed. 

Reliabilities  for  both  modes  of  testing  were  estimated  by  deriving  internal  con¬ 
sistency  indices  using  an  odd-even  item  split.  These  reliability  estimates  were 
adjusted  by  employing  the  Spearman- Brown  Prophecy  Formula  (Thorndike, 
1982).  Reliability  estimates  were  calculated  for  test  score,  average  degree  of  confi¬ 
dence,  and  average  response  latency  for  the  computer-based  test,  but  only  for  test 
score  and  average  degree  of  confidence  for  the  paper-based  test.  Response  latency 
was  not  measured  for  the  paper-based  test.  Equivalences  between  the  two  modes  of 
assessment  were  estimated  by  Pearson  product-moment  correlations  for  total  test 
score  and  average  degree  of  confidence.  These  correlations  were  considered 
indices  of  the  extent  to  which  the  two  types  of  testing  were  measuring  the  same 
semantic  knowledge  and  amount  of  assurance  in  answers. 

In  order  to  derive  discriminant  validity  estimates,  research  subjects  were  placed 
into  four  groups:  those  above  or  below  F-i4  or  E-2C  mean  flight  hours.  One  stepwise 
multiple  di.scriminant  analysis,  using  Wilks’  criterion  for  including  and  rejecting  vari¬ 
ables  and  their  associated  statistics,  was  computed  to  ascertain  how  well  computer- 
based  and  paper-based  measures  distinguished  among  the  defined  groups  that  were 
expected  to  differ  in  the  extent  of  their  knowledge  of  the  threat-parameter  database.  It 
was  thought  that  mean  flight  hours  reflect  operational  experience.  Those  individuals 
with  more  operational  experience  were  expected  to  perform  better  on  tests  of  threat- 
parameter  knowledge  than  those  with  less  experience.  Also,  F-14  crew  members 
were  expected  to  have  more  knowledge  of  specific  threat  parameters  than  E-2C  crew 
members  since  the  former  must  be  intimately  more  familiar  with  these  attributes  in 
order  to  make  successful  intercepts  than  the  latter. 


RESULTS 

Reliability  and  Equivalence  Estimates 

Split-half  reliability  and  equivalence  estimates  of  computer-based  and  paper-based 
measures  from  the  pooled  within-groups  correlation  matrices  for  the  different 
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groupings  are  tabulated  in  Table  1.  It  can  be  seen  that  the  adjusted  reliability  esti¬ 
mates  of  the  computer-based  and  paper-based  measures  are  moderate  to  high,  rang¬ 
ing  from  .74  to  .97.  None  of  the  differences  in  corresponding  reliabilities  for  com- 
puter-ba.sed  and  paper-based  measures,  test  score  and  average  degree  of  confi¬ 
dence,  was  found  to  be  statistically  significant  (p  >  .01)  using  a  test  described  by 
Edwards  (1964).  This  suggests  that  the  computer-based  and  paper-based  measures 
were  not  significantly  different  in  reliability  or  internal  consistency. 

Considering  the  computer-based  measures,  it  was  ascertained  that  the  reliability 
estimate  for  average  degree  of  confidence  was  significantly  {p  <  .01)  higher  than 
the  reliability  estimates  for  average  response  latency  and  test  score.  Also,  the  relia¬ 
bility  estimate  for  response  latency  was  significantly  higher  than  the  one  computed 
for  test  score.  Focusing  on  the  paper-based  measures,  it  was  found  that  the  reliabili¬ 
ty  estimate  for  average  degree  of  confidence  was  significantly  (p  <  .01)  higher  than 
the  reliability  estimate  for  test  score.  These  results  implied  that  these  measures  can 
be  ranked  in  order  of  their  internal  consistencies  from  highest  to  lowest  as  follows: 
average  degree  of  confidence,  average  re,sponse  latency,  and  test  score. 

Equivalence  estimates  for  test  score  and  average  degree  of  confidence  measures, 
respectively,  were  .76  and  .82.  These  suggest  that  the  computer-based  and  paper- 
based  measures  had  anywhere  from  approximately  58  to  67%  variance  in  common, 
implying  that  these  different  modes  of  assessment  were  somewhat  or  partially 
equivalent.  The  equivalences  for  test  score  and  average  degree  of  confidence  mea¬ 
sures  were  not  significantly  (p  >  .01 )  different. 

Discriminant  Validity  Estimates 

The  discriminant  analysis  computed  to  determine  how  well  computer-based  and 
paper-based  measures  differentiated  groups  defined  by  above  or  below  F-14  or  E- 
2C  mean  flight  hours  yielded  one  significant  discriminant  function.  According  to 
the  multiple  discriminant  analysis  model  (Cooley  &  Lohnes,  1962;  Tatsuoka,  1971; 
Van  de  Geer,  1971),  the  maximum  number  of  derived  discriminant  functions  is 
either  one  less  than  the  number  of  groups  or  equal  to  the  number  of  discriminating 
variables,  whichever  is  smaller.  Since  there  were  four  groups  to  be  discriminated, 
this  analysis  yielded  three  discriminant  functions,  but  only  one  of  them  was  signifi¬ 
cant.  Consequently,  solely  this  significant  di.scriminant  function  and  its  associated 
statistics  arc  pre.scnted. 

I  he  statistics  associated  with  the  significant  function,  standardized  discriminant- 
function  coefficients,  pooled  within-groups  correlations  between  the  function  and 
computer-based  and  paper-based  measures,  and  group  centroids  for  above  or  below 


Table  1.  Spill-Half  Reliability  and  Equivalence  Estimates 
of  Computer-Based  and  Paper-Based  Measures  for 


Semantic  Knowledge 

Reliability 

Measure 

Computer  Based 

Paper-Based 

Equivalence 

Score 

74 

76 

.76 

Confidence 

96 

97 

82 

Latency 

88 

— 

— 

Note  Split-halt  reliability  estimates  were  adiusted  by  employing 
the  Spearman-Brown  Prophecy  Formula 
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F-14  or  E-2C  mean  flight  hours  are  presented  in  Table  2.  It  can  be  seen  that  the  sin¬ 
gle  significant  discriminant  function  accounted  for  approximately  82%  of  the  vari¬ 
ance  among  the  four  groups.  The  discriminant-function  coefficients  that  consider 
the  interrelationships  or  interdependencies  among  the  multivariate  measures 
revealed  the  relative  contribution  or  comparative  importance  of  these  variables  in 
defining  this  derived  dimension  to  be  the  paper-based  total  score  (PTS),  the  com¬ 
puter-based  total  score  (CTS),  and  the  computer-based  total  average  degree  of  con¬ 
fidence  (CTC),  respectively.  The  computer-based  total  average  latency  (CTL)  and 
the  paper-based  total  average  degree  of  confidence  (PTC)  were  considered  unim¬ 
portant  in  specifying  this  discriminant  function  since  the  absolute  values  of  their 
coefficients  were  each  below  .4.  The  within-groups  correlations  that  are  computed 
for  each  individual  measure  partialling  out  the  interrelationships  of  all  the  other 
variables  indicated  that  the  major  contributors  to  the  significant  di.scriminant  func¬ 
tion  were  CTC,  CTS,  and  CTL,  respectively,  all  computer-based  measures.  The 
group  centroids  showed  how  the  performance  of  the  F-14  crew  members  clustered 
together  along  one  end  of  the  derived  dimension,  while  the  performance  of  the  E- 
2C  crew  members  clustered  together  along  the  other  end  of  the  continuum.  The 
means  and  standard  deviations  for  groups  above  or  below  F-14  or  E-2C  mean  flight 
hours,  univariate  F  ratios,  and  levels  of  significance  for  computer-based  and  paper- 
ba.sed  measures  are  tabulated  in  Table  3.  Considering  the  measures  as  univariate 
variables  —  that  is,  independent  of  their  multivariate  relationships  with  one  another 
—  these  statistics  revealed  that  the  three  computer-based  measures  CTC,  CTS,  and 
CTL,  respectively,  significantly  differentiated  the  four  groups,  not  the  paper-ba.sed 
measures,  PTS  and  PTC.  Applying  Duncan’s  multiple  range  test  (Kirk,  1968)  on 
the  group  means  for  the  important  individual  measures  indicated  that  F-14  crews 
significantly  (p  <  .05)  outperformed  E-2C  crews  on  (TS,  CTC,  and  CTL.  The  mul¬ 
tivariate  and  subsequent  univariate  results  established  the  discriminant  validity  of 
computer-ba.sed  measures  to  be  superior  to  that  of  paper-based  measures. 


Table  2.  Statistics  Associated  With  Significant  Discriminant  Function,  Standardized 
Discriminant-Function  Coefficients,  Pooled  Withln-Groups  Correlations  Between 
the  Discriminant  Function  and  Computer-Based  and  Paper-Based  Measures,  and 
Group  Centroids  for  Above  or  Below  F-14  or  E-2C  Mean  Flight  Hours 


Discriminant  Function 


Eigen 

Percent 

Canonical 

Wilks 

Chi- 

value 

Variance 

Correlation 

Lambda 

Square 

dl 

P 

44 

82  43 

.55 

64 

31  38 

15 

.008 

Discriminant 

Within-Group 

Measure 

Coefficient 

Correlation 

Group 

Centroid 

CTS 

91 

.51 

Above  F-14 
Mean  Hours 

.10 

CTC 

84 

.57 

Below  F-14 
Mean  Hours 

.39 

CTL 

-  24 

-45 

Above  E-2C 
Mean  Hours 

-1.35 

PTS 

-119 

-  00 

Below  E-2C 
Mean  Hours 

-1.50 

PTC. 

-  17 

36 

Note  CTS  =  Computer-based  total  test  score  CTC  =  average  degree  ot  confidence.  CTL  =  average 
response  latency  PTS  =  paper  based  total  test  score  PTC  =  average  degree  ot  confidence  CTL  war 
measured  m  seconds 
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Table  3.  Means  and  Standard  Deviations  for  Groups  Above  or  Below  F-14  or  E-2C  Mean 
Flight  Hours,  Univariate  F  Ratios,  and  Levels  of  Significance  for  Computer-Based  and 

Paper-Based  Measures 


Group 


Above  F-14 

Below  F-14 

Above  E'2C 

Below  E-2C 

Flight  Hours 

Flight  Hours 

Flight  Hours 

Flight  Hours 

Measure 

(0=26) 

(0=37) 

(0=5) 

II 

F 

P 

CTS 

M 

60  58 

59.62 

44.60 

43.14 

2.94 

.039 

SD 

15.75 

18.77 

15.68 

17.37 

CTC 

M 

75.58 

80.84 

48.60 

64.57 

4.11 

.010 

SD 

21.57 

19.80 

21.23 

26.48 

CTL 

M 

8.42 

7.81 

9.49 

11.06 

2.28 

.087 

SD 

3.31 

2.77 

4.10 

3.94 

PTS 

M 

51.65 

49.73 

45.80 

52.86 

.19 

.900 

SD 

18.26 

20.38 

11.86 

13.91 

PTC 

M 

72.23 

76.70 

53.0" 

69.71 

2.14 

.103 

SD 

23.02 

18.10 

•*  •• 

20.94 

DISCUSSION 


This  study  established  that  (a)  computer-based  and  paper-based  measures,  test 
score  and  average  degree  of  confidence,  are  not  significantly  different  in  reliabili¬ 
ty  or  internal  consistency;  (b)  for  computer-based  and  paper-based  measures, 
average  degree  of  confidence  has  a  higher  reliability  than  test  score;  (c)  the 
equivalence  estimates  for  computer-based  and  paper-based  measures  (test  score 
and  average  degree  of  confidence)  were  not  significantly  different;  and  (d)  the 
discriminant  validity  of  the  computer-based  measures  was  superior  to  paper- 
based  measures. 

The  finding  that  computer-based  and  paper-based  measures,  test  score  and  aver¬ 
age  degree  of  confidence,  were  not  significantly  different  in  reliability  or  internal 
consistency  partially  agrees  with  the  corresponding  result  established  in  a  study  by 
Federico  (1991 ).  In  that  re.search,  computer-based  and  paper-based  measures  of  test 
scores  for  recognition  of  aircraft  silhouettes  were  found  to  be  equally  reliable;  how¬ 
ever,  the  computer-based  measure  of  average  degree  of  confidence  was  found  to  be 
less  reliable  than  its  paper-based  counterpart.  The  present  study  suggested  that 
equivalence  estimates  for  computer-ba.sed  and  paper-ba.sed  measures,  test  score  and 
average  degree  of  confidence,  were  dissimilar  in  magnitude.  This  finding  is  similar 
to  that  established  in  the  Federico  (1991)  ,study  where  computer-based  and  paper- 
based  measures  of  test  score  were  less  equivalent  than  these  measures  of  average 
degree  of  confidence.  Lastly,  some  of  the  results  of  the  present  research  demonstrat¬ 
ed  that  the  di.scriminative  validity  of  the  computer-based  measures  was  superior  to 
paper-ba.sed  measures.  This  finding  is  in  partial  agreement  with  that  found  in  the 
Federico  (1991 )  research  where  this  was  also  e,stablished  with  respect  to  .some  .sta¬ 
tistical  criteria.  However,  according  to  other  criteria  the  di.scriminative  validities  of 
computer-based  and  paper-based  measures  were  about  the  same. 

Computer-based  and  paper-based  media  vary  in  the  nature  of  the  reciprocal 
interaction  and  information  feedback  they  provide  to  individuals  during  learning  or 
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testing.  Isolating  on  assessment,  usually,  computer-ba.sed  media  provide  more 
immediate  interaction  and  information  feedback  to  the  test-taker  than  paper-based 
media.  Typically,  the  computer  system  presents  a  question  that  an  individual 
attempts  to  answer,  and  the  quality  of  the  response  is  immediately  displayed.  Or,  in 
computerized  adaptive  or  tailored  testing  the  system  pre.sents  a  test  item  and,  based 
upon  the  correcmess  of  the  individual’s  response,  then  provides  either  a  more  or  a 
less  difficult  follow-on  item.  That  is,  the  computer-based  system  is  interactive  to 
the  degree  that  it  is  designed  to  tailor,  or  adapt,  the  level  of  difficulty  of  the  admin¬ 
istered  items  as  a  function  of  test-takers’  responses.  In  these  contexts,  the  direct 
interaction  or  continuous  transaction  between  the  test-taker  and  the  system  is 
intrinsic  to  the  establishment  of  the  feedback  loop,  which  was  the  case  with  the 
computer-based  assessment  system  used  in  this  reported  research. 

Three  distinct  functions  have  been  attributed  to  feedback,  namely:  (a)  reinforce¬ 
ment,  (b)  information,  and  (c)  motivation  (Bilodeau,  1966).  Each  of  these  three 
attributes  is  more  apparent  in  computer-based  than  in  paper-based  assessment. 
Computer-based  testing  or  gaming  systems  can  be  designed  to  reinforce  directly 
correct  re.sponses  by  awarding  a  number  of  points  to  the  te.st-taker  or  player.  It  is 
difficult,  though  not  impossible,  for  paper-based  testing  to  match  the  promptness  of 
the  reinforcement  provided  by  computer-based  testing.  Usually,  the  immediacy  of 
the  information  provided  by  a  computer-based  system  concerning  the  correctness 
or  incorrectness  of  a  response  to  a  test  item  exceeds  that  provided  by  a  paper-based 
system.  Partly  due  to  the  almost  simultaneous  administration  of  reinforcement  and 
display  of  information  as  a  direct  consequence  of  responding,  the  level  of  motiva¬ 
tion  typically  elicited  by  a  computer-based  system  should  surpass  that  aroused  by  a 
paper-based  system.  Also,  some  computer-based  quizzes  are  essentially  game-like 
in  nature,  like  the  ones  employed  in  this  reported  research  and  in  Federico’s  (1991) 
study.  The  incentive  provided  by  assessment  systems  such  as  these  approaches  that 
of  video  games  where  players  attempt  to  outperform  one  another  by  maximizing 
their  individual  payoffs.  The  desire  to  establish  a  personal  best,  to  surpass  the  oth¬ 
ers.  or  to  be  in  the  top  ten  is  instilled  by  using  some  well-designed  computer-ba.sed 
testing  systems  and/or  becau.se  of  the  mere  fact  that  individual  performance  can  be 
visible  to  others  when  interacting  with  this  video  game-like  technology,  thus  elicit¬ 
ing  socially  motivated  competitive  behavior.  This  desire  testifies  to  the  typically 
higher  level  of  engagement  experienced  by  people  when  employing  computer- 
based  than  paper-based  as.sessment  systems. 

Computer-based  testing  or  gaming  .systems  usually  have  as  an  integral  component 
video  display  terminals  that  are  similar  to  television  screens.  Con.sequently,  people 
possibly  perceive,  expect,  or  anticipate  a  priori  that  assessment  sy.stems  such  as 
Uiesc  may  be  more  engaging,  engrossing,  or  entertaining  than  papier-ba.sed  tests  or 
games,  regardless  of  the  subject-matter  domain.  That  is,  personal  perceptions  or 
expectations  concerning  the  measurement  .sy.stem  as  well  as  the  assessment  situation 
predispose  how  tests  are  taken  by  individuals.  Within  the  current  Zeitgeist,  it  seems 
reasonable  to  expect  that  people  generally  have  more  positive  attitudes  toward  com- 
puter-ba.scd  than  paper-based  media,  partly  becau.se  of  the  perceived  higher  enter¬ 
tainment  potential  of  the  former.  The  as.sociative,  affective,  and  active  tendencies 
attributed  to  these  stronger  positive  attitudes  may  culminate  in  people  perceiving 
computer-ba.sed  media  as  more  absorbing  and  attracting  than  paper-based  media. 
Ihat  IS.  in  this  high  technology  era  individuals  seem  more  interested  in.  or  inclined 
toward,  attending  to  or  heeding  computer-based  rather  than  paper-based  media. 

I'.xtrapolating  from  this  implicit  framework,  or  engagement  theory,  it  was 
expected  that  the  computer-ba.sed  test  used  in  this  study  would  have  higher  reliabil- 
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ities  and  validities  than  the  paper-based  test,  regardless  of  the  measurement  criteria 
employed,  because  the  former  should  have  provided  more  immediate  interaction 
and  information  feedback,  instilled  a  higher  level  of  motivation,  and  engaged  indi¬ 
viduals  to  a  greater  extent  than  the  latter.  More  interaction,  feedback,  motivation, 
or  engagement  evoked  by  the  computer-based  test  should  have  encouraged  or 
exhorted  individuals  to  exert  or  energize  their  performances  during  measurement 
maximally  and  continuously.  That  is,  subjects  were  expected  to  amplify  their 
respective  pertormances,  becau.se  of  the  greater  expected  engagement  elicited  in 
them,  when  interacting  with  a  computer-based  rather  than  a  paper-based  test. 
Subjects  should  have  consi.stently  or  continuously  sustained  their  maximum  efforts 
throughout  the  entire  computer-ba.sed  test,  culminating  in  possibly  less  respon.se 
variability,  and  consequently  more  reliability,  than  the  paper-based  test.  Higher 
reliability,  in  turn,  should  have  resulted  in  higher  discriminative  validity  for  com¬ 
puter-based  than  for  paper-based  measurement.  This  was  not  entirely  and  empiri¬ 
cally  established  by  this  reported  re.search.  Contrary  to  what  was  implicitly  expect¬ 
ed,  this  study  demonstrated  that  the  reliabilities  of  computer-based  and  paper-based 
tests  are  not  significantly  different.  Compatible  with  the  presumed  framework, 
however,  this  investigation  found  that  computer-based  measures  had  validity  supe¬ 
rior  to  paper-based  measures. 

Hofer  and  Green  (1985)  were  concerned  that  computer-based  assessment  would 
introduce  irrelevant  or  extraneous  factors  that  would  likely  degrade  test  perfor¬ 
mance.  These  computer-correlated  factors  may  alter  the  nature  of  the  task  to  such  a 
degree  that  it  would  be  difficult  for  a  computer-based  test  and  its  paper-based 
counterpart  to  measure  the  same  construct  or  content.  This  could  impact  upon  reli¬ 
ability,  validity,  and  normative  data,  as  well  as  other  as.sessment  attributes.  Several 
plausible  reasons,  they  stated,  may  contribute  to  different  performances  on  these 
distinct  kinds  of  testing:  (a)  state  anxiety  instigated  when  confronted  by  computer- 
based  testing,  (b)  lack  of  computer  familiarity  on  the  pan  of  the  test-taker,  and  (c) 
changes  in  re.sponse  format  required  by  the  two  modes  of  assessment.  These  differ¬ 
ent  dimensions  could  result  in  tests  that  are  nonequivalent;  however,  in  this  repon- 
ed  research  these  diverse  factors  had  no  apparent  impact. 

Gn  the  other  hand,  there  are  a  number  of  known  differences  between  computer- 
based  and  paper-based  assessment  that  may  affect  equivalence  and  validitv 
(Green.  1986): 

1.  Passive  omitting  of  items  is  usually  not  permitted  on  computer-based  tests.  An 
individual  must  lespond.  unlike  with  mo.st  paper-based  tests. 

2.  Computerized  tests  typically  do  not  permit  backtracking.  The  test-taker  cannot 
easily  review  items,  alter  respon.ses.  or  delay  answering  que.stions. 

3.  fhe  capacity  of  the  computer  screen  can  have  an  impact  on  what  usually  are  long 
test  items  (e.g.,  paragraph  comprehension).  These  may  be  shonened  to  accom- 
iiKKlate  the  computer  display,  thus  partially  changing  the  nature  of  the  task. 

4.  The  quality  ot  computer  graphics  may  affect  the  comprehension  and  degree  of 
difficult)  of  the  item. 

5.  Pressing  a  key  or  using  a  mouse  is  probably  easier  than  marking  an  answer 
sheet.  This  may  impact  upon  the  validity  of  .speeded  tests. 

6.  I  he  computer  typically  di.splays  items  individually:  traditional  time  limits  arc  no 
longer  ncccssar). 

■Assuming  that  these  abstract  distinctions  may  affect  the  equivalence  and  validits 
ol  computer-based  and  paper-based  assessment,  the  omission  of  items  and  back- 
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tracking  on  paper-based  tests  in  this  research  was  not  permitted  in  order  to  simulate 
computer-based  tests.  Computer  screen  capacity  was  of  no  consequence  in  this 
study  since  nope  of  the  test  items  was  long.  Graphics  were  not  employed  in  the 
paper-based  test  or  ius  computer-based  counterpart  and  consequently  played  no  part 
in  item  comprehension  or  difficulty.  In  this  study  neither  the  computer-based  nor 
paper-ba.sed  measurement  employed  true  speeded  tests.  Also,  to  mimic  the  individ¬ 
ual  display  of  items  on  the  computer-based  te.sts.  the  subjects  were  closely  moni¬ 
tored  as  they  took  the  paper-based  test,  and  were  reminded  to  expedite  their 
responses  without  retracing. 

When  evaluating  or  comparing  different  media  for  instruction  and  assessment, 
one  must  keep  in  mind  that  the  newer  medium  may  simply  be  seen  as  more  inter¬ 
esting,  engaging,  and  challenging  by  the  students.  This  novelty  effect  seems  to  dis¬ 
appear  as  rapidly  as  it  appears.  However,  in  research  studies  conducted  over  a  rela¬ 
tively  short  time  span,  for  example,  a  few  days  or  months  at  the  most,  this  effect 
may  still  linger  and  affect  the  evaluation  by  its  enhancement  of  the  impact  of  the 
more  novel  medium  (Colvin  &  Clark,  1984),  which  could  have  occurred  in  this 
reported  research.  When  matching  media  to  distinct  subject  matters,  course  con¬ 
tents,  or  core  concepts,  some  research  evidence  (Jamison,  Suppes,  &  Welles,  1974) 
indicates  that,  other  than  in  obvious  cases,  just  about  any  medium  will  be  effective 
for  different  content.  Extrapolating  this  notion  to  the  measurement  domain,  the 
validity  results  of  this  study  seemed  to  suggest,  contrary  to  the  above,  that  different 
media  may  be  differentially  effective  testing  of  the  same  subject  matter. 
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